{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.7.9 64-bit ('DataAnalysis': conda)",
   "metadata": {
    "interpreter": {
     "hash": "dcf060ad4b7fa10e323b7c0259734b0cdb630d87fa5c1539ce2c08c5bd388944"
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "import nltk\n",
    "from nltk.corpus import brown\n",
    "from nltk.corpus import names\n",
    "\n",
    "from tools import show_subtitle"
   ]
  },
  {
   "source": [
    "# Chap6 学习分类文本\n",
    "\n",
    "学习目标：\n",
    "\n",
    "1.  识别出语言数据中可以用于分类的特征\n",
    "2.  构建用于自动执行语言处理任务的语言模型\n",
    "3.  从语言模型中学习与语言相关的知识"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 6.3 评估 (Evaluation)"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 6.3.1 测试集"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这种方式构建的数据集会导致训练集与测试集中的句子取自同一篇文章中，结果就是句子的风格相对一致\n",
    "# 从而产生过拟合（即泛化准确率过高，与实际情况不符），影响分类器向其他数据集的推广\n",
    "def split_all():\n",
    "    show_subtitle(\"直接基于所有数据进行分割\")\n",
    "    tagged_sents = list(brown.tagged_sents(categories='news'))\n",
    "    random.shuffle(tagged_sents)\n",
    "    size = int(len(tagged_sents) * 0.1)\n",
    "    train_set, test_set = tagged_sents[size:], tagged_sents[:size]\n",
    "    return train_set, test_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这种方式从文章层次将数据分开，就不会出现上面分解数据时出现的问题\n",
    "def split_file_ids():\n",
    "    show_subtitle(\"基于文章名称进行数据分割\")\n",
    "    file_ids = brown.fileids(categories='news')\n",
    "    size = int(len(file_ids) * 0.1)\n",
    "    train_set, test_set = brown.tagged_sents(file_ids[size:]), brown.tagged_sents(file_ids[:size])\n",
    "    return train_set, test_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 直接从不同的类型中取数据，效果更好\n",
    "def split_categories():\n",
    "    show_subtitle(\"基于文章类型进行数据分割\")\n",
    "    train_set, test_set = brown.tagged_sents(categories='news'), brown.tagged_sents(categories='fiction')\n",
    "    return train_set, test_set"
   ]
  },
  {
   "source": [
    "### 6.3.2 精确度"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 原始数据集合\n",
    "labeled_names = (\n",
    "        [\n",
    "            (name, 'male')\n",
    "            for name in names.words('male.txt')\n",
    "        ]\n",
    "        +\n",
    "        [\n",
    "            (name, 'female')\n",
    "            for name in names.words('female.txt')\n",
    "        ]\n",
    ")\n",
    "# 乱序排序数据集\n",
    "random.shuffle(labeled_names)\n",
    "\n",
    "def gender_features(word):\n",
    "    return {'prefix1': word[0:1], 'prefix2': word[0:2], 'suffix1': word[-1:], 'suffix2': word[-2:]}\n",
    "\n",
    "\n",
    "feature_sets = [(gender_features(n), gender) for (n, gender) in labeled_names]\n",
    "train_set, test_set = feature_sets[500:], feature_sets[:500]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Accuracy: 0.81\n"
     ]
    }
   ],
   "source": [
    "# 准确度：用于评估分类的质量\n",
    "# 这里不能使用从 brown 提供的数据集，应该使用的第1节中数据集（名字：性别）来训练和测试\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)\n",
    "print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set)))"
   ]
  },
  {
   "source": [
    "### 6.3.3 精确度 和 召回率\n",
    "-   真阳性：是相关项目中正确识别为相关的（True Positive，TP）\n",
    "-   真阴性：是不相关项目中正确识别为不相关的（True Negative，TN）\n",
    "-   假阳性：（I型错误）是不相关项目中错误识别为相关的（False Positive，FP）\n",
    "-   假阴性：（II型错误）是相关项目中错误识别为不相关的（False Negative，FN）\n",
    "-   精确度：（Precision）表示发现的项目中有多少是相关的，TP/（TP+FP）\n",
    "-   召回率：（Recall）表示相关的项目中发现了多少，TP/（TP+FN）\n",
    "-   F-度量值（F-Measure）：也叫F-得分（F-Score），组合精确度和召回率为一个单独的得分\n",
    "      被定义为精确度和召回率的调和平均数(2*Precision*Recall)/(Precision+Recall)=2*/(1/Precision + 1/Recall)"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 6.3.4 混淆矩阵"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将标记好的句子转化成列表\n",
    "def tag_list(tagged_sents):\n",
    "    return [\n",
    "        tag\n",
    "        for sent in tagged_sents\n",
    "        for (word, tag) in sent\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def apply_tagger(tagger, corpus):\n",
    "    return [\n",
    "        # 删除已经标注的标签，方便测试\n",
    "        tagger.tag(nltk.tag.untag(sent))\n",
    "        # nltk.tag.untag() 只有对句子进行去标签处理\n",
    "        for sent in corpus\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "     |                             C           N           P           V       |\n     |           A     A     A     O     D     O     N     R     P     E       |\n     |           D     D     D     N     E     U     U     O     R     R       |\n     |     .     J     P     V     J     T     N     M     N     T     B     X |\n-----+-------------------------------------------------------------------------+\n   . | <7099>    .     .     .     .     .     .     .     .     .     .     . |\n ADJ |     . <4803>    1    81     .     .    66     .     .     .     7     . |\n ADP |     .     3 <6984>   32    14     8     1     .     .   571     .     . |\n ADV |     .    81   105 <2766>    4    10     1     .     .    29     1     . |\nCONJ |     .     .     1     5 <1854>    2     .     .     .     .     .     . |\n DET |     .     .    51     3     . <7360>    1     .     1     .     .     . |\nNOUN |     1    57     1     2     .     .<15026>   23     .     .    60     . |\n NUM |     .     .     .     .     .     .     .  <722>    .     .     .     . |\nPRON |     .     .    96     .     .    14     .     . <2181>    .     .     . |\n PRT |     .    13    82    11     .     .     .     .     . <1429>    1     . |\nVERB |     .     9    15     2     .     .    68     .     .     2 <9800>    . |\n   X |     .     .     .     .     .     .     .     .     .     .     .   <44>|\n-----+-------------------------------------------------------------------------+\n(row = reference; col = test)\n\n"
     ]
    }
   ],
   "source": [
    "train_sents = brown.tagged_sents(categories='editorial', tagset='universal')\n",
    "t0 = nltk.DefaultTagger('NN')\n",
    "t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n",
    "t2 = nltk.BigramTagger(train_sents, backoff=t1)\n",
    "gold = tag_list(train_sents)\n",
    "test = tag_list(apply_tagger(t2, train_sents))\n",
    "confusion_matrix = nltk.ConfusionMatrix(gold, test)\n",
    "print(confusion_matrix)"
   ]
  },
  {
   "source": [
    "### 6.3.4 交叉验证\n",
    "将原始语料细分为N个子集，在不同的测试集上执行多重评估，然后组合这些评估的得分。\n",
    "\n",
    "交叉验证的作用：\n",
    "\n",
    "1.  解决数据集合过小的问题，\n",
    "2.  研究不同的训练集上性能变化有多大"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "接下来的三节中，将研究三种机器学习分类模型：决策树、朴素贝叶斯分类器 和 最大熵分类器\n",
    "\n",
    "仔细研究这些分类器的收获：\n",
    "\n",
    "1.  如何基于一个训练集上的数据来选择合适的学习模型\n",
    "2.  如何选择合适的特征应用到相关的学习模型上\n",
    "3.  如何提取和编码特征，使之包含最大的信息量，以及保持这些特征之间的相关性"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}